library(ISLR2)
library(car)
library(mgcv)
library(rgl)
library(splines)
library(pbapply)
In the past two labs we have seen several methods that allow us to fit a smooth regression function \[y_i=f(x_i)+\varepsilon_i, \quad i=1,\ldots,N\] with \(x_i\in \mathbb{R}\). In today’s lab, we would like to move forward looking at a way to perform multivariate nonparametric regression, that is when \(x_i\in \mathbb{R}^p\), \(p>1\). We will do so by means of an additive approach through the Generalized Additive Models. Let us go back to our well-known Prestige dataset, but let us consider the education variable as well.
with(Prestige, scatterplotMatrix(data.frame(prestige, education, income)))
The relationship between prestige and education seems fairly linear. On the contrary, as we have extensively experienced, prestige and income are quite nonlinearly related.
First off, let’s try to fit a multivariate linear model
model_lm=lm(prestige ~ education + income, data=Prestige)
summary(model_lm)
##
## Call:
## lm(formula = prestige ~ education + income, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4040 -5.3308 0.0154 4.9803 17.6889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.8477787 3.2189771 -2.127 0.0359 *
## education 4.1374444 0.3489120 11.858 < 2e-16 ***
## income 0.0013612 0.0002242 6.071 2.36e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.81 on 99 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.7939
## F-statistic: 195.6 on 2 and 99 DF, p-value: < 2.2e-16
Building prediction surfaces is as simple as it was for the bidimensional case
education.grid=seq(range(Prestige$education)[1],range(Prestige$education)[2],length.out = 100)
income.grid=seq(range(Prestige$income)[1],range(Prestige$income)[2],length.out = 100)
grid=expand.grid(education.grid,income.grid)
names(grid)=c('education','income')
pred=predict(model_lm,newdata=grid)
persp3d(education.grid,income.grid,pred,col='blue',border="black",lwd=0.3)
with(Prestige, points3d(education,income,prestige,col='black',size=5))
What happens if I add the interaction?
model_lm_interaction <- lm(prestige ~ education + income + education:income, data=Prestige)
# model_lm_interaction <- lm(prestige ~ education * income, data=Prestige) # alternatively
summary(model_lm_interaction)
##
## Call:
## lm(formula = prestige ~ education + income + education:income,
## data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0026 -5.1918 0.3768 4.5915 19.9312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.207e+01 6.581e+00 -3.353 0.001136 **
## education 5.373e+00 5.797e-01 9.270 4.65e-15 ***
## income 3.944e-03 1.007e-03 3.918 0.000165 ***
## education:income -1.961e-04 7.460e-05 -2.628 0.009958 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.587 on 98 degrees of freedom
## Multiple R-squared: 0.8113, Adjusted R-squared: 0.8055
## F-statistic: 140.5 on 3 and 98 DF, p-value: < 2.2e-16
I manage to capture a bit of the nonlinearity with the interaction term… Can we do better?
pred_interaction=predict(model_lm_interaction,newdata=grid)
persp3d(education.grid,income.grid,pred_interaction,col='grey30')
with(Prestige,points3d(education,income,prestige,col='black',size=5))
We certainly can by employing Generalized Additive Models! The main
reference (also, surprisingly easy and straightforward to use) is Generalized
Additive Models: An Introduction with R by Simon N. Wood. The
package mgcv accompanies the book, it is a very-well
structured and actively maintained package that provides computation for
smoothness estimation for a variety of models: we will only scratch the
surface today! Recall the analytical expression of a GAM:
It is called an additive model because we calculate a
separate \(f_{j}\) for each \(X_{j}\), and then add together all of their
contributions. Operatively, gam() is the function of the
mgcvpackage for fitting these types of models. It works
very much like the glm function for generalized linear
models. Clearly with the former we are allowed to specify diverse
smooths options for the model terms.
model_gam=gam(prestige ~ s(education,bs='cr') + s(income,bs='cr'),data = Prestige)
What s() does is basically building a “smooth” term for
each of the covariates I am putting in. The behavior is very similar to
smooth.spline we have seen last time, no need to set the
number of knots, nor (like in the gam package) the
equivalent degrees of freedom; mgcv will take care of
everything for you. With bs we indicate the (penalized)
smoothing basis to use: cr provides a cubic spline basis
defined by a modest sized set of knots spread evenly through the
covariate values. They are penalized by the conventional intergrated
square second derivative cubic spline penalty. The fitting is based on
penalized likelihood, where the term to be penalized is the second
derivatives of the smooths.
Let us look at the models summary
summary(model_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## prestige ~ s(education, bs = "cr") + s(income, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.8333 0.6884 68.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(education) 3.154 3.913 39.22 <2e-16 ***
## s(income) 2.996 3.604 16.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.837 Deviance explained = 84.7%
## GCV = 51.984 Scale est. = 48.34 n = 102
If you are familiar with the output of a glm object, the one above should not surprise you that much. The only slight exotic part may be the “approximate significance of smooth terms”: Effective degrees of freedom (edf) is a summary statistic of GAM and it reflects the degree of non-linearity of a curve. Notice that in this case the adjusted R squared is better, even without the interaction term.
Let us look at the residuals:
hist(model_gam$residuals)
qqnorm(model_gam$residuals)
Normality test:
shapiro.test(model_gam$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_gam$residuals
## W = 0.98887, p-value = 0.5603
Not bad at all! Notice that mgcv works with smoothing
bases; nevertheless, the beauty of GAMs is that we can use every type of
univariate smoother as building blocks for fitting an additive model. We
would for example like to employ natural cubic splines for providing
nonparametric components in our GAMs. In this case, we do not actually
need anything new then the old but gold lm function and the
basis matrix generator contained in the splines package.
model_gam_ns <-
lm(prestige ~ ns(education, df = 3) + ns(income, df = 3), data = Prestige)
In most situations, the differences in the GAMs obtained using smoothing splines versus natural splines are small, let us look at the residuals scatterplot for the two models:
plot(model_gam_ns$residuals,model_gam$residuals)
cor(model_gam_ns$residuals,model_gam$residuals)
## [1] 0.988401
Pretty much the same fit!
GAMs make the effect interpretation of each regressor to the response
variable easy to understand. Again, it works very much like the analysis
of a linear model: I look at the contribution of a single predictor
holding all the others fixed. This is graphically best accomplished with
the plot method conveniently provided by the mgcv
package.
plot(model_gam)
The same can be achieved for the natural splines model, with a tiny workaround (programming tip!)
gam::plot.Gam(model_gam_ns, se=TRUE)
Again, we see that there are basically no difference between
model_gam and model_gam_ns.
We have noticed already at the very beginning that the relationship between prestige and education seems fairly linear. We now test whether a smooth function is needed for education or if we can be satisfied with a linear contribution. The reduced model reads:
model_gam_reduced=gam(prestige ~ education + s(income,bs='cr'),data = Prestige)
summary(model_gam_reduced)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## prestige ~ education + s(income, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1342 3.7355 1.107 0.271
## education 3.9764 0.3415 11.644 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(income) 3.353 4.012 15.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.825 Deviance explained = 83.3%
## GCV = 54.563 Scale est. = 51.7 n = 102
The linear contribution of education is highly significant, but which model is better? Let us perform a formal test as we would do if we needed to perform nested model selection with lm: the anova function serves the purpose.
anova(model_gam_reduced,model_gam, test = "F")
Of course we could have had it hard-coded:
N <- nrow(Prestige)
RSS_reduced <- deviance(model_gam_reduced)
RSS_full <- deviance(model_gam)
df_full <- N-sum(summary(model_gam)$s.table[,2])-1 # -1 for the intercept
df_reduced <- N-sum(summary(model_gam_reduced)$s.table[,2])-2 # -2 for the intercept and the linear contribution of education
df_difference <- df_reduced-df_full
F_value_manual <- ((RSS_reduced-RSS_full)/df_difference)/(RSS_full/summary(model_gam)$residual.df)
pf(F_value_manual, df1 = df_difference,df2 = df_full,lower.tail = FALSE)
## [1] 0.02791939
It seems that a nonlinear component for education is needed after all. Important: if my residuals were not normal I would have needed to perform a non-parametric test, but you now know how to do it right? More on this later.
Performing prediction for GAMs is as simple as it always has been:
pred_gam=predict(model_gam,newdata=grid)
persp3d(education.grid,income.grid,pred_gam,col='grey30')
with(Prestige,points3d(education,income,prestige,col='black',size=5))
We can include interactions in a GAM as well: we could either employ a simpler approach (the one we will subsequently use) by defining an interaction term and smooth it, or we could employ more sophisticated bidimensional splines, like thin plate spline (Wood, 2003). In general, it is better to consider bivariate splines when the two dimensions are tightly related (e.g., different coordinates in the space).
model_gam_inter = gam(prestige ~ s(education, bs = 'cr') +
s(income, bs ='cr') +
s(I(income * education), bs = 'cr'),
data = Prestige)
# model_gam_inter=gam(prestige ~ s(education,bs='cr') + s(income,bs='cr')+ s(education,income,bs='tp'),data = Prestige) # thin plate spline
pred_inter = predict(model_gam_inter,
newdata = data.frame(grid, inter = grid$education * grid$income))
persp3d(education.grid, income.grid, pred_inter, col = 'grey30')
with(Prestige,
points3d(education, income, prestige, col = 'black', size = 5))
Dr. Simoni, a data scientist, just had his second kid (a boy, 58 cm tall) and he is interested to know how tall his child will be at 25 years of age, given his height at birth. To do so, he has collected the height at birth of 100 males, alongside the height they reached when they were 25. He has collected these data in the file boyheight.rda, height.25 is the height [cm] at 25 years of age, while height.b is the length of the newborn [cm]. Assume that \[height25 = f (height.b) + \varepsilon.\]
Build a degree 1 regression spline model, with breaks at the 25th and 75th percentile of height.b, to predict the height at 25 from the height at birth. Provide a plot of the regression line, compute the pointwise prediction (round it at the second decimal digit) for the height at 25 of Dr. Simoni newborn child, and calculate, using a bootstrap approach on the residuals, the bias, variance and Mean Squared Error (MSE) of such prediction
Dr. Simoni is not particularly satisfied with the predictions of such a simple model, and would like you to do more. So build a prediction model for the height at 25 based on a smoothing spline of order 4 (select the lambda parameter via Leave-One-Out CV). Report the optimal lambda value (2 decimal digits), provide a plot of the regression line, alongside the point-wise prediction of the height at 25 of Dr. Simoni’s kid, and calculate using a bootstrap approach on the residuals the bias, variance and MSE of such prediction (fix the lambda value to the one obtained via Leave-One-Out CV).
load(here::here("Block III - Nonparametric regression/data/boyheight.rda"))
plot(x=height.b, y=height.25)
# Breaks at the 25th and 75th percentile of *height.b*
knots_heigth <- quantile(height.b,probs = c(.25,.75))
# Build a degree 1 regression spline model
fit_spline <- lm(height.25~bs(height.b,knots = knots_heigth,degree = 1))
new_data_seq <- seq(min(height.b),max(height.b),length.out=100)
# Predict the height at 25 from the height at birth
preds=predict(fit_spline, newdata = list(height.b=new_data_seq),se=T)
se.bands=cbind(preds$fit +2* preds$se.fit ,preds$fit -2* preds$se.fit)
# Plot of the regression line
plot(y=height.25,x=height.b ,cex =.5, col =" darkgrey " )
lines(new_data_seq,preds$fit ,lwd =2, col =" blue")
matlines(new_data_seq,se.bands ,lwd =1, col =" blue",lty =3)
# Pointwise prediction
simoni_weigth <- 58
(pointwise_pred <- predict(fit_spline, newdata = list(height.b=simoni_weigth)))
## 1
## 191.4279
# Bias, variance and Mean Squared Error (MSE) of pointwise_pred
wrapper_f <- function(){
height.25.boot=fitted+sample(residuals,n,replace=T)
new_model=lm(height.25.boot ~ bs(height.b, knots=knots_heigth,degree=1))
predict(new_model, newdata = list(height.b=simoni_weigth))
}
B <- 200
fitted=fit_spline$fitted.values
residuals=fit_spline$residuals
n <- length(height.25)
set.seed(100)
boot_d <- pbreplicate(n = B,expr = wrapper_f(),simplify = "vector")
(variance_pred <- var(boot_d))
## [1] 1.158379
(bias_pred <- pointwise_pred-mean(boot_d))
## 1
## 0.04960756
(MSE_pred <- variance_pred +bias_pred^2)
## 1
## 1.16084
# smoothing spline of order 4 (cv=TRUE for LOOCV)
fit_smooth <- smooth.spline(x = height.b, y=height.25,cv = TRUE)
# optimal lambda value
round(fit_smooth$lambda,2)
## [1] 0.01
# plot of the regression line
plot(y=height.25,x=height.b ,cex =.5, col =" darkgrey " )
lines(fit_smooth$x, fit_smooth$y, col="blue", lwd=2)
fitted_smooth <- predict(fit_smooth, height.b)$y
fitted_res <- height.25-fitted_smooth
pred_smooth_simoni <- predict(fit_smooth, simoni_weigth)
# pointwise prediction of the height at 25 of Dr. Simoni’s kid
pred_smooth_simoni$y
## [1] 191.9669
boot_smooth=numeric(B)
set.seed(100)
for(sample in 1:B){
height.25.boot=fitted_smooth+sample(fitted_res,n,replace=T)
new_model=smooth.spline(y = height.25.boot,x = height.b, lambda = fit_smooth$lambda)
boot_smooth[sample]=predict(new_model, simoni_weigth)$y
}
(variance_pred_smooth <- var(boot_smooth))
## [1] 0.9159435
(bias_pred_smooth <- pred_smooth_simoni$y-mean(boot_smooth))
## [1] -0.3520232
(MSE_pred_smooth <- variance_pred_smooth +bias_pred_smooth^2)
## [1] 1.039864
Dr. Bisacciny, Ph.D is becoming increasingly worried about the fact that the bees that he decided to place near a corn field close to the Italian city of Milan tend to die way earlier than the ones that are placed close to the small village of Jovençan, in Aosta Valley. He suspects that additional factors that influence the survival time of a the bee are its productivity, and its weight. For this reason, he decides to run an experiment, in which he selects 10,000 bees, placing 5,000 of them in Aosta Valley and 5,000 in Milan. Dr. Bisacciny runs the experiment for 50 days, during which he annotates when a bee passes away. After the end of the experiment, he starts to analyse the data. In the file ex02.rda you can find a dataframe with information about the status of the bee (1 if alive, 2 if dead), the survival time of the bee (surv.time), its weight (weight) and productivity (prod) as well as its being in Jovençan or in Milan. Modeling this data as i.i.d. realizations of a four dimensional random variable:
load(here::here("Block III - Nonparametric regression/data/ex02.rda"))
head(bee)
fit_gam <- gam(log(surv.time)~s(weight,bs = "cr")+s(prod,bs = "cr")+location,data = bee)
Estimated model
\[\begin{equation} log(surv.time) =\beta_{0}+f\left(weight\right)+f\left(prod\right)+\beta_1 location_{Milan}+\epsilon \end{equation}\]table_fit_gam <- summary(fit_gam)
# adjusted R2
(r_2_squared <- table_fit_gam$r.sq)
## [1] 0.4569872
# p-values of the tests (parametric coefficients)
table_fit_gam$p.pv
## (Intercept) locationMilan
## 0 0
# p-values of the tests (smooth terms)
table_fit_gam$s.pv
## [1] 0.1002523 0.4254312
plot(fit_gam)
hist(fit_gam$residuals)
# Remove weight
fit_gam_reduced_1 <- gam(log(surv.time)~s(prod,bs = "cr")+location,data = bee)
anova(fit_gam_reduced_1,fit_gam, test = "F")
# Remove prod
fit_gam_reduced_2 <- gam(log(surv.time)~location,data = bee)
anova(fit_gam_reduced_2, fit_gam_reduced_1, test = "F")
I can use a simple linear model with just a term: location. No need to use the gam machinery anymore.
fit_lm <- lm(log(surv.time)~location,data = bee)
# Avg log surv time in Milan
sum(fit_lm$coefficients)
## [1] 1.168164
# mean(log(bee$surv.time[bee$location=="Milan"])) # alternatively
# Avg log surv time in Jovencan
fit_lm$coefficients[1]
## (Intercept)
## 3.010891
# mean(log(bee$surv.time[bee$location=="Jovencan"])) # alternatively
More concise way using tapply and functional
programming
tapply(log(bee$surv.time), bee$location, mean)
## Jovencan Milan
## 3.010891 1.168164